A Market Segmentation and Purchase Drivers Process

Hicham Aber


\clearpage


\clearpage

The Data

First we load the data to use (see the raw .Rmd file to change the data file as needed):

# Please ENTER the name of the file with the data used. The file should be a
# .csv with one row per observation (e.g. person) and one column per
# attribute. Do not add .csv at the end, make sure the data are numeric.
datafile_name = "./data/Speed.csv"

# Please enter the minimum number below which you would like not to print -
# this makes the readability of the tables easier. Default values are either
# 10e6 (to print everything) or 0.5. Try both to see the difference.
MIN_VALUE = 0.5

# Please enter the maximum number of observations to show in the report and
# slides.  DEFAULT is 10. If the number is large the report may be slow.
max_data_report = 10


\clearpage

# Please ENTER then original raw attributes to use.  Please use numbers, not
# column names, e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
factor_attributes_used = c(2:158)

# Please ENTER the selection criterions for the factors to use.  Choices:
# 'eigenvalue', 'variance', 'manual'
factor_selectionciterion = "manual"

# Please ENTER the desired minumum variance explained (Only used in case
# 'variance' is the factor selection criterion used).
minimum_variance_explained = 60  # between 1 and 100

# Please ENTER the number of factors to use (Only used in case 'manual' is
# the factor selection criterion used).
manual_numb_factors_used = 10

# Please ENTER the rotation eventually used (e.g. 'none', 'varimax',
# 'quatimax', 'promax', 'oblimin', 'simplimax', and 'cluster' - see
# help(principal)). Default is 'varimax'
rotation_used = "varimax"

<!--html_preserve-->

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
id 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
gender 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
idg 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
condtn 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
wave 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
round 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00
position 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00
positin1
order 4.00 3.00 10.00 5.00 7.00 6.00 1.00 2.00 8.00 9.00
partner 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00
pid 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
match 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00
int_corr 0.14 0.54 0.16 0.61 0.21 0.25 0.34 0.50 0.28 -0.36
samerace 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
age_o 27.00 22.00 22.00 23.00 24.00 25.00 30.00 27.00 28.00 24.00
race_o 2.00 2.00 4.00 2.00 3.00 2.00 2.00 2.00 2.00 2.00
pf_o_att 35.00 60.00 19.00 30.00 30.00 50.00 35.00 33.33 50.00 100.00
pf_o_sin 20.00 0.00 18.00 5.00 10.00 0.00 15.00 11.11 0.00 0.00
pf_o_int 20.00 0.00 19.00 15.00 20.00 30.00 25.00 11.11 25.00 0.00
pf_o_fun 20.00 40.00 18.00 40.00 10.00 10.00 10.00 11.11 10.00 0.00
pf_o_amb 0.00 0.00 14.00 5.00 10.00 0.00 5.00 11.11 0.00 0.00
pf_o_sha 5.00 0.00 12.00 5.00 20.00 10.00 10.00 22.22 15.00 0.00
dec_o 0.00 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 0.00
attr_o 6.00 7.00 10.00 7.00 8.00 7.00 3.00 6.00 7.00 6.00
sinc_o 8.00 8.00 10.00 8.00 7.00 7.00 6.00 7.00 7.00 6.00
intel_o 8.00 10.00 10.00 9.00 9.00 8.00 7.00 5.00 8.00 6.00
fun_o 8.00 7.00 10.00 8.00 6.00 8.00 5.00 6.00 8.00 6.00
amb_o 8.00 7.00 10.00 9.00 9.00 7.00 8.00 8.00 8.00 6.00
shar_o 6.00 5.00 10.00 8.00 7.00 7.00 7.00 6.00 9.00 6.00
like_o 7.00 8.00 10.00 7.00 8.00 7.00 2.00 7.00 6.50 6.00
prob_o 4.00 4.00 10.00 7.00 6.00 6.00 1.00 5.00 8.00 6.00
met_o 2.00 2.00 1.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00
age 21.00 21.00 21.00 21.00 21.00 21.00 21.00 21.00 21.00 21.00
field 152.00 152.00 152.00 152.00 152.00 152.00 152.00 152.00 152.00 152.00
field_cd 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
undergra 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
mn_sat 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
tuition 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
race 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00
imprace 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00
imprelig 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00
from 56.00 56.00 56.00 56.00 56.00 56.00 56.00 56.00 56.00 56.00
zipcode 262.00 262.00 262.00 262.00 262.00 262.00 262.00 262.00 262.00 262.00
income 239.00 239.00 239.00 239.00 239.00 239.00 239.00 239.00 239.00 239.00
goal 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00
date 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00
go_out 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
career 185.00 185.00 185.00 185.00 185.00 185.00 185.00 185.00 185.00 185.00
career_c
sports 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00
tvsports 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00
exercise 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00
dining 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00
museums 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
art 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
hiking 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00
gaming 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
clubbing 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00
reading 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00
tv 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00
theater 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
movies 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00
concerts 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00 10.00
music 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00 9.00
shopping 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00
yoga 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
exphappy 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00 3.00
expnum 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00
attr1_1 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00
sinc1_1 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00
intel1_1 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00
fun1_1 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00
amb1_1 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00
shar1_1 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00
attr4_1
sinc4_1
intel4_1
fun4_1
amb4_1
shar4_1
attr2_1 35.00 35.00 35.00 35.00 35.00 35.00 35.00 35.00 35.00 35.00
sinc2_1 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00
intel2_1 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00 15.00
fun2_1 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00 20.00
amb2_1 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00
shar2_1 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00
attr3_1 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00
sinc3_1 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00
fun3_1 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00
intel3_1 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00
amb3_1 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00
attr5_1
sinc5_1
intel5_1
fun5_1
amb5_1
dec 1.00 1.00 1.00 1.00 1.00 0.00 1.00 0.00 1.00 1.00
attr 6.00 7.00 5.00 7.00 5.00 4.00 7.00 4.00 7.00 5.00
sinc 9.00 8.00 8.00 6.00 6.00 9.00 6.00 9.00 6.00 6.00
intel 7.00 7.00 9.00 8.00 7.00 7.00 7.00 7.00 8.00 6.00
fun 7.00 8.00 8.00 7.00 7.00 4.00 4.00 6.00 9.00 8.00
amb 6.00 5.00 5.00 6.00 6.00 6.00 6.00 5.00 8.00 10.00
shar 5.00 6.00 7.00 8.00 6.00 4.00 7.00 6.00 8.00 8.00
like 7.00 7.00 7.00 7.00 6.00 6.00 6.00 6.00 7.00 6.00
prob 6.00 5.00 6.00 6.00 5.00 5.00 7.00 7.00 6.00
met 2.00 1.00 1.00 2.00 2.00 2.00 2.00 2.00 2.00
match_es 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00
attr1_s
sinc1_s
intel1_s
fun1_s
amb1_s
shar1_s
attr3_s
sinc3_s
intel3_s
fun3_s
amb3_s
satis_2 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00
length 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00
numdat_2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
attr7_2
sinc7_2
intel7_2
fun7_2
amb7_2
shar7_2
attr1_2 19.44 19.44 19.44 19.44 19.44 19.44 19.44 19.44 19.44 19.44
sinc1_2 16.67 16.67 16.67 16.67 16.67 16.67 16.67 16.67 16.67 16.67
intel1_2 13.89 13.89 13.89 13.89 13.89 13.89 13.89 13.89 13.89 13.89
fun1_2 22.22 22.22 22.22 22.22 22.22 22.22 22.22 22.22 22.22 22.22
amb1_2 11.11 11.11 11.11 11.11 11.11 11.11 11.11 11.11 11.11 11.11
shar1_2 16.67 16.67 16.67 16.67 16.67 16.67 16.67 16.67 16.67 16.67
attr4_2
sinc4_2
intel4_2
fun4_2
amb4_2
shar4_2
attr2_2
sinc2_2
intel2_2
fun2_2
amb2_2
shar2_2
attr3_2 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00
sinc3_2 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00
intel3_2 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00
fun3_2 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00
amb3_2 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00
attr5_2
sinc5_2
intel5_2
fun5_2
amb5_2
you_call 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
them_cal 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
<!--/html_preserve-->

The data we use here have the following descriptive statistics:

#{r} #iprint.df(round(my_summary(ProjectDataFactor), 2)) #

Step 3: Check Correlations

This is the correlation matrix of the customer responses to the 157 attitude questions - which are the only questions that we will use for the segmentation (see the case):

<!--html_preserve-->

id gender idg condtn wave round position positin1 order partner pid match int_corr samerace age_o race_o pf_o_att pf_o_sin pf_o_int pf_o_fun pf_o_amb pf_o_sha dec_o attr_o sinc_o intel_o fun_o amb_o shar_o like_o prob_o met_o age field field_cd undergra mn_sat tuition race imprace imprelig from zipcode income goal date go_out career career_c sports tvsports exercise dining museums art hiking gaming clubbing reading tv theater movies concerts music shopping yoga exphappy expnum attr1_1 sinc1_1 intel1_1 fun1_1 amb1_1 shar1_1 attr4_1 sinc4_1 intel4_1 fun4_1 amb4_1 shar4_1 attr2_1 sinc2_1 intel2_1 fun2_1 amb2_1 shar2_1 attr3_1 sinc3_1 fun3_1 intel3_1 amb3_1 attr5_1 sinc5_1 intel5_1 fun5_1 amb5_1 dec attr sinc intel fun amb shar like prob met match_es attr1_s sinc1_s intel1_s fun1_s amb1_s shar1_s attr3_s sinc3_s intel3_s fun3_s amb3_s satis_2 length numdat_2 attr7_2 sinc7_2 intel7_2 fun7_2 amb7_2 shar7_2 attr1_2 sinc1_2 intel1_2 fun1_2 amb1_2 shar1_2 attr4_2 sinc4_2 intel4_2 fun4_2 amb4_2 shar4_2 attr2_2 sinc2_2 intel2_2 fun2_2 amb2_2 shar2_2 attr3_2 sinc3_2 intel3_2 fun3_2 amb3_2 attr5_2 sinc5_2 intel5_2 fun5_2 amb5_2 you_call them_cal
id 1
gender 1.00 0.03 0.00 0.00 0.02 0.00 0.01 0.01 0.00 0.00 -0.11 -0.20 -0.03 -0.09 -0.12 -0.02 -0.12 -0.11 -0.12 0.11
idg 0.03 1.00 0.32 0.09 0.39 0.17 0.15 0.15 0.00 -0.01 0.00 -0.02 0.02 0.10 0.07 0.07 -0.03 -0.03 0.00 -0.01
condtn 0.00 0.32 1.00 0.22 0.82 0.33 0.32 0.32 -0.05 0.07 -0.04 0.01 0.14 0.09 0.09 0.03 0.05 -0.01 0.03 -0.04
wave 0.00 0.09 0.22 1.00 0.24 0.08 0.09 0.09 -0.02 -0.01 -0.01 -0.06 0.60 0.44 0.49 -0.01 0.00 -0.04 0.04 -0.01
round 0.02 0.39 0.82 0.24 1.00 0.39 0.40 0.39 -0.03 0.03 -0.04 -0.01 0.12 0.11 0.07 0.04 0.05 -0.02 0.01 -0.04
position 0.00 0.17 0.33 0.08 0.39 1.00 0.16 0.17 -0.01 0.03 -0.02 0.04 0.03 0.05 0.05 0.01 -0.04 -0.02 -0.02 -0.02
positin1 1
order 0.01 0.15 0.32 0.09 0.40 0.16 1.00 0.16 -0.04 0.01 -0.03 0.00 0.05 0.04 0.03 0.02 0.02 -0.01 0.01 -0.03
partner 0.01 0.15 0.32 0.09 0.39 0.17 0.16 1.00 0.00 -0.01 -0.02 -0.01 0.05 0.04 0.03 0.02 0.02 -0.01 0.00 0.00
pid 1
match 0.00 0.00 -0.05 -0.02 -0.03 -0.01 -0.04 0.00 1.00 0.01 0.52 -0.01 -0.02 -0.01 0.00 -0.01 0.01 0.03 -0.02 0.52
int_corr 1
samerace 0.00 -0.01 0.07 -0.01 0.03 0.03 0.01 -0.01 0.01 1.00 0.02 -0.02 -0.03 0.04 0.05 0.00 0.03 0.08 0.03 0.02
age_o 1
race_o 1
pf_o_att 1
pf_o_sin 1
pf_o_int 1
pf_o_fun 1
pf_o_amb 1
pf_o_sha 1
dec_o -0.11 0.00 -0.04 -0.01 -0.04 -0.02 -0.03 -0.02 0.52 0.02 1.00 -0.02 -0.01 0.02 0.04 0.01 0.06 0.07 -0.02 -0.05
attr_o 1
sinc_o 1
intel_o 1
fun_o 1
amb_o 1
shar_o 1
like_o 1
prob_o 1
met_o 1
age 1
field -0.20 -0.02 0.01 -0.06 -0.01 0.04 0.00 -0.01 -0.01 -0.02 -0.02 1.00 -0.03 -0.04 -0.06 0.09 0.06 0.03 0.11 0.01
field_cd 1
undergra -0.03 0.02 0.14 0.60 0.12 0.03 0.05 0.05 -0.02 -0.03 -0.01 -0.03 1.00 0.34 0.36 0.04 0.04 -0.06 0.08 -0.03
mn_sat -0.09 0.10 0.09 0.44 0.11 0.05 0.04 0.04 -0.01 0.04 0.02 -0.04 0.34 1.00 0.78 0.12 0.01 0.15 0.14 -0.04
tuition -0.12 0.07 0.09 0.49 0.07 0.05 0.03 0.03 0.00 0.05 0.04 -0.06 0.36 0.78 1.00 0.07 0.02 0.17 0.15 -0.04
race 1
imprace 1
imprelig 1
from -0.02 0.07 0.03 -0.01 0.04 0.01 0.02 0.02 -0.01 0.00 0.01 0.09 0.04 0.12 0.07 1.00 -0.02 0.06 0.08 -0.02
zipcode -0.12 -0.03 0.05 0.00 0.05 -0.04 0.02 0.02 0.01 0.03 0.06 0.06 0.04 0.01 0.02 -0.02 1.00 0.15 0.12 -0.04
income -0.11 -0.03 -0.01 -0.04 -0.02 -0.02 -0.01 -0.01 0.03 0.08 0.07 0.03 -0.06 0.15 0.17 0.06 0.15 1.00 0.06 -0.02
goal 1
date 1
go_out 1
career -0.12 0.00 0.03 0.04 0.01 -0.02 0.01 0.00 -0.02 0.03 -0.02 0.11 0.08 0.14 0.15 0.08 0.12 0.06 1.00 -0.01
career_c 1
sports 1
tvsports 1
exercise 1
dining 1
museums 1
art 1
hiking 1
gaming 1
clubbing 1
reading 1
tv 1
theater 1
movies 1
concerts 1
music 1
shopping 1
yoga 1
exphappy 1
expnum 1
attr1_1 1
sinc1_1 1
intel1_1 1
fun1_1 1
amb1_1 1
shar1_1 1
attr4_1 1
sinc4_1 1
intel4_1 1
fun4_1 1
amb4_1 1
shar4_1 1
attr2_1 1
sinc2_1 1
intel2_1 1
fun2_1 1
amb2_1 1
shar2_1 1
attr3_1 1
sinc3_1 1
fun3_1 1
intel3_1 1
amb3_1 1
attr5_1 1
sinc5_1 1
intel5_1 1
fun5_1 1
amb5_1 1
dec 0.11 -0.01 -0.04 -0.01 -0.04 -0.02 -0.03 0.00 0.52 0.02 -0.05 0.01 -0.03 -0.04 -0.04 -0.02 -0.04 -0.02 -0.01 1.00
attr 1
sinc 1
intel 1
fun 1
amb 1
shar 1
like 1
prob 1
met 1
match_es 1
attr1_s 1
sinc1_s 1
intel1_s 1
fun1_s 1
amb1_s 1
shar1_s 1
attr3_s 1
sinc3_s 1
intel3_s 1
fun3_s 1
amb3_s 1
satis_2 1
length 1
numdat_2 1
attr7_2 1
sinc7_2 1
intel7_2 1
fun7_2 1
amb7_2 1
shar7_2 1
attr1_2 1
sinc1_2 1
intel1_2 1
fun1_2 1
amb1_2 1
shar1_2 1
attr4_2 1
sinc4_2 1
intel4_2 1
fun4_2 1
amb4_2 1
shar4_2 1
attr2_2 1
sinc2_2 1
intel2_2 1
fun2_2 1
amb2_2 1
shar2_2 1
attr3_2 1
sinc3_2 1
intel3_2 1
fun3_2 1
amb3_2 1
attr5_2 1
sinc5_2 1
intel5_2 1
fun5_2 1
amb5_2 1
you_call 1
them_cal 1
<!--/html_preserve-->

Questions

  1. Do you see any high correlations between the responses? Do they make sense?
  2. What do these correlations imply?

Answers:

* * * * * * * * * *

Step 4: Choose number of factors

Clearly the survey asked many redundant questions (can you think some reasons why?), so we may be able to actually "group" these 29 attitude questions into only a few "key factors". This not only will simplify the data, but will also greatly facilitate our understanding of the customers.

To do so, we use methods called Principal Component Analysis and factor analysis as also discussed in the Dimensionality Reduction readings. We can use two different R commands for this (they make slightly different information easily available as output): the command principal (check help(principal) from R package psych), and the command PCA from R package FactoMineR - there are more packages and commands for this, as these methods are very widely used.

## Warning in cor(r, use = "pairwise"): the standard deviation is zero

Likely variables with missing values are expnum

## Error in principal(ProjectDataFactor, nfactors = ncol(ProjectDataFactor), : I am sorry: missing values (NAs) in the correlation matrix do not allow me to continue.
## Please drop those variables and try again.
## Error in eval(expr, envir, enclos): object 'UnRotated_Results' not found
## Error in as.data.frame(unclass(UnRotated_Factors)): object 'UnRotated_Factors' not found
## Error in ncol(UnRotated_Factors): object 'UnRotated_Factors' not found
## Warning in PCA(ProjectDataFactor, graph = FALSE): Missing values are
## imputed by the mean of the variable: you should use the imputePCA function
## of the missMDA package

Let's look at the variance explained as well as the eigenvalues (see session readings):

<!--html_preserve-->

Eigenvalue Pct of explained variance Cumulative pct of explained variance
Component 1 8.58 5.47 5.47
Component 2 8.21 5.23 10.70
Component 3 6.61 4.21 14.91
Component 4 5.51 3.51 18.42
Component 5 4.75 3.02 21.44
Component 6 4.06 2.59 24.03
Component 7 3.95 2.51 26.55
Component 8 3.66 2.33 28.88
Component 9 3.43 2.19 31.07
Component 10 3.28 2.09 33.16
Component 11 3.12 1.99 35.15
Component 12 2.84 1.81 36.95
Component 13 2.58 1.64 38.60
Component 14 2.47 1.57 40.17
Component 15 2.21 1.41 41.58
Component 16 2.05 1.30 42.88
Component 17 2.01 1.28 44.16
Component 18 1.93 1.23 45.39
Component 19 1.84 1.17 46.56
Component 20 1.77 1.13 47.69
Component 21 1.73 1.10 48.79
Component 22 1.71 1.09 49.88
Component 23 1.63 1.04 50.92
Component 24 1.59 1.01 51.93
Component 25 1.58 1.01 52.94
Component 26 1.48 0.95 53.89
Component 27 1.47 0.93 54.82
Component 28 1.45 0.92 55.74
Component 29 1.40 0.89 56.63
Component 30 1.35 0.86 57.50
Component 31 1.34 0.85 58.35
Component 32 1.30 0.83 59.18
Component 33 1.27 0.81 59.98
Component 34 1.26 0.80 60.78
Component 35 1.23 0.79 61.57
Component 36 1.19 0.76 62.33
Component 37 1.17 0.75 63.08
Component 38 1.15 0.73 63.81
Component 39 1.14 0.72 64.53
Component 40 1.12 0.71 65.24
Component 41 1.09 0.69 65.94
Component 42 1.07 0.68 66.62
Component 43 1.07 0.68 67.30
Component 44 1.03 0.66 67.96
Component 45 1.03 0.66 68.61
Component 46 1.02 0.65 69.26
Component 47 0.99 0.63 69.89
Component 48 0.98 0.62 70.51
Component 49 0.96 0.61 71.12
Component 50 0.95 0.61 71.73
Component 51 0.93 0.59 72.32
Component 52 0.92 0.59 72.91
Component 53 0.90 0.58 73.48
Component 54 0.89 0.57 74.05
Component 55 0.89 0.57 74.62
Component 56 0.86 0.55 75.17
Component 57 0.86 0.55 75.72
Component 58 0.84 0.54 76.25
Component 59 0.84 0.54 76.79
Component 60 0.83 0.53 77.32
Component 61 0.81 0.52 77.83
Component 62 0.79 0.50 78.33
Component 63 0.78 0.50 78.83
Component 64 0.76 0.49 79.32
Component 65 0.76 0.48 79.80
Component 66 0.74 0.47 80.27
Component 67 0.73 0.47 80.74
Component 68 0.72 0.46 81.20
Component 69 0.71 0.46 81.65
Component 70 0.70 0.45 82.10
Component 71 0.69 0.44 82.54
Component 72 0.67 0.43 82.96
Component 73 0.66 0.42 83.39
Component 74 0.65 0.41 83.80
Component 75 0.64 0.41 84.21
Component 76 0.63 0.40 84.61
Component 77 0.61 0.39 84.99
Component 78 0.60 0.38 85.38
Component 79 0.60 0.38 85.76
Component 80 0.58 0.37 86.13
Component 81 0.57 0.36 86.49
Component 82 0.56 0.36 86.85
Component 83 0.55 0.35 87.20
Component 84 0.54 0.35 87.55
Component 85 0.54 0.35 87.89
Component 86 0.54 0.34 88.23
Component 87 0.53 0.34 88.57
Component 88 0.52 0.33 88.90
Component 89 0.52 0.33 89.23
Component 90 0.51 0.32 89.56
Component 91 0.49 0.31 89.87
Component 92 0.48 0.31 90.17
Component 93 0.47 0.30 90.48
Component 94 0.47 0.30 90.77
Component 95 0.45 0.29 91.06
Component 96 0.44 0.28 91.34
Component 97 0.43 0.27 91.62
Component 98 0.42 0.27 91.89
Component 99 0.42 0.27 92.15
Component 100 0.41 0.26 92.42
Component 101 0.41 0.26 92.68
Component 102 0.40 0.25 92.93
Component 103 0.39 0.25 93.18
Component 104 0.38 0.24 93.42
Component 105 0.37 0.24 93.65
Component 106 0.37 0.23 93.89
Component 107 0.36 0.23 94.11
Component 108 0.35 0.22 94.33
Component 109 0.34 0.21 94.55
Component 110 0.33 0.21 94.75
Component 111 0.32 0.21 94.96
Component 112 0.32 0.20 95.16
Component 113 0.31 0.20 95.36
Component 114 0.30 0.19 95.55
Component 115 0.30 0.19 95.74
Component 116 0.29 0.19 95.93
Component 117 0.29 0.18 96.11
Component 118 0.28 0.18 96.29
Component 119 0.28 0.18 96.47
Component 120 0.27 0.17 96.64
Component 121 0.26 0.17 96.81
Component 122 0.26 0.17 96.98
Component 123 0.26 0.16 97.14
Component 124 0.25 0.16 97.30
Component 125 0.25 0.16 97.46
Component 126 0.24 0.15 97.61
Component 127 0.23 0.15 97.76
Component 128 0.23 0.14 97.90
Component 129 0.22 0.14 98.04
Component 130 0.22 0.14 98.18
Component 131 0.21 0.13 98.31
Component 132 0.20 0.13 98.44
Component 133 0.20 0.13 98.57
Component 134 0.19 0.12 98.69
Component 135 0.19 0.12 98.81
Component 136 0.18 0.12 98.93
Component 137 0.17 0.11 99.03
Component 138 0.16 0.10 99.14
Component 139 0.15 0.10 99.23
Component 140 0.15 0.10 99.33
Component 141 0.14 0.09 99.42
Component 142 0.13 0.08 99.50
Component 143 0.12 0.08 99.58
Component 144 0.11 0.07 99.65
Component 145 0.11 0.07 99.72
Component 146 0.10 0.06 99.78
Component 147 0.09 0.06 99.84
Component 148 0.08 0.05 99.89
Component 149 0.06 0.04 99.93
Component 150 0.05 0.03 99.96
Component 151 0.02 0.01 99.97
Component 152 0.02 0.01 99.98
Component 153 0.02 0.01 99.99
Component 154 0.01 0.00 100.00
Component 155 0.01 0.00 100.00
Component 156 0.00 0.00 100.00
Component 157 0.00 0.00 100.00
<!--/html_preserve-->

## Error in loadNamespace(name): there is no package called 'webshot'

Questions:

  1. Can you explain what this table and the plot are? What do they indicate? What can we learn from these?
  2. Why does the plot have this specific shape? Could the plotted line be increasing?
  3. What characteristics of these results would we prefer to see? Why?

Answers

* * * * * * * * * *

Step 5: Interpret the factors

Let's now see how the "top factors" look like.

To better visualize them, we will use what is called a "rotation". There are many rotations methods. In this case we selected the varimax rotation. For our data, the 10 selected factors look as follows after this rotation:

## Warning in cor(r, use = "pairwise"): the standard deviation is zero

Likely variables with missing values are expnum

## Error in principal(ProjectDataFactor, nfactors = max(factors_selected), : I am sorry: missing values (NAs) in the correlation matrix do not allow me to continue.
## Please drop those variables and try again.
## Error in eval(expr, envir, enclos): object 'Rotated_Results' not found
## Error in as.data.frame(unclass(Rotated_Factors)): object 'Rotated_Factors' not found
## Error in ncol(Rotated_Factors): object 'Rotated_Factors' not found
## Error in sort(Rotated_Factors[, 1], decreasing = TRUE, index.return = TRUE): object 'Rotated_Factors' not found
## Error in eval(expr, envir, enclos): object 'Rotated_Factors' not found
## Error in iprint.df(Rotated_Factors, scale = TRUE): object 'Rotated_Factors' not found

To better visualize and interpret the factors we often "suppress" loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:

## Error in eval(expr, envir, enclos): object 'Rotated_Factors' not found
## Error in Rotated_Factors_thres[abs(Rotated_Factors_thres) < MIN_VALUE] <- NA: object 'Rotated_Factors_thres' not found
## Error in is.data.frame(x): object 'Rotated_Factors' not found
## Error in rownames(Rotated_Factors): object 'Rotated_Factors' not found
## Error in iprint.df(Rotated_Factors_thres, scale = TRUE): object 'Rotated_Factors_thres' not found

Questions

  1. What do the first couple of factors mean? Do they make business sense?
  2. How many factors should we choose for this data/customer base? Please try a few and explain your final choice based on a) statistical arguments, b) on interpretation arguments, c) on business arguments (you need to consider all three types of arguments)
  3. How would you interpret the factors you selected?
  4. What lessons about data science do you learn when doing this analysis? Please comment.

Answers

*nature lovers, status symbol, utility/adventure, price consciousness, experienced, brand sensitivity, DIY * * * * * * * * *

Step 6: Save factor scores

We can now either replace all initial variables used in this part with the factors scores or just select one of the initial variables for each of the selected factors in order to represent that factor. Here is how the factor scores are for the first few respondents:

## Error in eval(expr, envir, enclos): object 'Rotated_Results' not found
## Error in ncol(NEW_ProjectData): object 'NEW_ProjectData' not found
## Error in head(NEW_ProjectData, 10): object 'NEW_ProjectData' not found

Questions

  1. Can you describe some of the people using the new derived variables (factor scores)?
  2. Which of the 29 initial variables would you select to represent each of the factors you selected?

Answers

* * * * * * * * * *


\clearpage

Part 2: Customer Segmentation

The code used here is along the lines of the code in the session 5-6 reading ClusterAnalysisReading.Rmd. We follow the process described in the Cluster Analysis reading.

In this part we also become familiar with:

  1. Some clustering Methods;
  2. How these tools can be used in practice.

A key family of methods used for segmentation is what is called clustering methods. This is a very important problem in statistics and machine learning, used in all sorts of applications such as in Amazon's pioneer work on recommender systems. There are many mathematical methods for clustering. We will use two very standard methods, hierarchical clustering and k-means. While the "math" behind all these methods can be complex, the R functions used are relatively simple to use, as we will see.

(All user inputs for this part should be selected in the code chunk in the raw .Rmd file)

# Please ENTER then original raw attributes to use for the segmentation (the
# 'segmentation attributes') Please use numbers, not column names, e.g.
# c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
segmentation_attributes_used = c(4, 5, 6, 7, 11, 12, 13, 14, 15, 16, 17, 18, 
    19)  #c(10,19,5,12,3) 

# Please ENTER then original raw attributes to use for the profiling of the
# segments (the 'profiling attributes') Please use numbers, not column
# names, e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
profile_attributes_used = c(2:82)

# Please ENTER the number of clusters to eventually use for this report
numb_clusters_used = 7  # for boats possibly use 5, for Mall_Visits use 3

# Please enter the method to use for the segmentation:
profile_with = "hclust"  #  'hclust' or 'kmeans'

# Please ENTER the distance metric eventually used for the clustering in
# case of hierarchical clustering (e.g. 'euclidean', 'maximum', 'manhattan',
# 'canberra', 'binary' or 'minkowski' - see help(dist)).  DEFAULT is
# 'euclidean'
distance_used = "euclidean"

# Please ENTER the hierarchical clustering method to use (options are:
# 'ward', 'single', 'complete', 'average', 'mcquitty', 'median' or
# 'centroid').  DEFAULT is 'ward'
hclust_method = "ward.D"

# Please ENTER the kmeans clustering method to use (options are:
# 'Hartigan-Wong', 'Lloyd', 'Forgy', 'MacQueen').  DEFAULT is 'Lloyd'
kmeans_method = "Lloyd"
## Error in if (sd(r) != 0) (r - mean(r))/sd(r) else 0 * r: missing value where TRUE/FALSE needed

Steps 1-2: Explore the data

(This was done above, so we skip it)

Step 3. Select Segmentation Variables

For simplicity will use one representative question for each of the factor we found in Part 1 (we can also use the "factor scores" for each respondent) to represent our survey respondents. These are the segmentation_attributes_used selected below. We can choose the question with the highest absolute factor loading for each factor. For example, when we use 5 factors with the varimax rotation we can select questions Q.1.9 (I see my boat as a status symbol), Q1.18 (Boating gives me a feeling of adventure), Q1.4 (I only consider buying a boat from a reputable brand), Q1.11 (I tend to perform minor boat repairs and maintenance on my own) and Q1.2 (When buying a boat getting the lowest price is more important than the boat brand) - try it. These are columns 10, 19, 5, 12, and 3, respectively of the data matrix Projectdata.

Step 4: Define similarity measure

We need to define a distance metric that measures how different people (observations in general) are from each other. This can be an important choice. Here are the differences between the observations using the distance metric we selected:

<!--html_preserve-->

Obs.01 Obs.02 Obs.03 Obs.04 Obs.05 Obs.06 Obs.07 Obs.08 Obs.09 Obs.10
Obs.01 0
Obs.02 32 0
Obs.03 17 45 0
Obs.04 17 31 17 0
Obs.05 13 32 14 5 0
Obs.06 26 12 36 21 22 0
Obs.07 10 31 19 14 10 22 0
Obs.08 13 31 18 10 6 20 5 0
Obs.09 27 15 37 22 23 5 22 20 0
Obs.10 69 42 84 71 71 50 67 68 50 0
<!--/html_preserve-->

Step 5: Visualize Pair-wise Distances

We can see the histogram of, say, the first 2 variables (can you change the code chunk in the raw .Rmd file to see other variables?)

<!--html_preserve-->

<!--/html_preserve-->

or the histogram of all pairwise distances for the euclidean distance:

## Error in loadNamespace(name): there is no package called 'webshot'

Step 6: Method and Number of Segments

We need to select the clustering method to use, as well as the number of cluster. It may be useful to see the dendrogram from Hierarchical Clustering, to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:

## Error in loadNamespace(name): there is no package called 'webshot'

We can also plot the "distances" traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.

## Error in loadNamespace(name): there is no package called 'webshot'

Here is the segment membership of the first 10 respondents if we use hierarchical clustering:

<!--html_preserve-->

Observation Number Cluster_Membership
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
<!--/html_preserve-->

while this is the segment membership if we use k-means:

## Error in do_one(nmeth): NA/NaN/Inf in foreign function call (arg 1)
## Error in cbind(1:length(kmeans_clusters$cluster), kmeans_clusters$cluster): object 'kmeans_clusters' not found
## Error in colnames(ProjectData_with_kmeans_membership) <- c("Observation Number", : object 'ProjectData_with_kmeans_membership' not found
## Error in head(ProjectData_with_kmeans_membership, max_data_report): object 'ProjectData_with_kmeans_membership' not found

Step 7: Profile and interpret the segments

In market segmentation one may use variables to profile the segments which are not the same (necessarily) as those used to segment the market: the latter may be, for example, attitude/needs related (you define segments based on what the customers "need"), while the former may be any information that allows a company to identify the defined customer segments (e.g. demographics, location, etc). Of course deciding which variables to use for segmentation and which to use for profiling (and then activation of the segmentation for business purposes) is largely subjective. In this case we can use all survey questions for profiling for now - the profile_attributes_used variables selected below.

There are many ways to do the profiling of the segments. For example, here we show how the average answers of the respondents in each segment compare to the average answer of all respondents using the ratio of the two. The idea is that if in a segment the average response to a question is very different (e.g. away from ratio of 1) than the overall average, then that question may indicate something about the segment relative to the total population.

Here are for example the profiles of the segments using the clusters found above. First let's see just the average answer people gave to each question for the different segments as well as the total population:

## Error in eval(expr, envir, enclos): object 'kmeans_clusters' not found
## Error in unique(cluster_memberships_kmeans): object 'cluster_memberships_kmeans' not found

<!--html_preserve-->

Population Seg.1 Seg.2 Seg.3 Seg.4 Seg.5 Seg.6 Seg.7
id 7.79 8.43 8.90 9.38 9.88 6.56
gender 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50
idg 17.33 14.98 16.20 17.30 18.14 19.23 12.35 22.50
condtn 1.83 1.60 1.83 1.69 1.87 2.00 1.74 2.00
wave 11.35 2.00 5.30 8.96 11.51 14.49 18.15 21.00
round 16.87 14.52 15.72 16.81 17.64 18.76 12.12 22.00
position 9.04 8.21 8.37 8.87 9.37 10.01 6.75 11.50
positin1 8.87 9.37 10.13 7.04 11.50
order 8.93 7.76 8.36 8.90 9.32 9.88 6.49 11.50
partner 8.96 7.79 8.44 8.88 9.39 9.88 6.56 11.50
pid 38.45 212.16 288.91 377.90 463.65 530.50
match 0.16 0.15 0.20 0.16 0.14 0.18 0.16 0.15
int_corr 0.19 0.23 0.18
samerace 0.40 0.40 0.40 0.38 0.38 0.50 0.33 0.35
age_o 26.63 26.93
race_o 2.76 2.97 2.98
pf_o_att 16.71 22.62 25.59
pf_o_sin 18.06 17.04 17.23
pf_o_int 19.05 20.30 20.41
pf_o_fun 17.56 18.03 17.68
pf_o_amb 14.08 10.35 7.45
pf_o_sha 14.54 11.63
dec_o 0.42 0.40 0.44 0.44 0.40 0.44 0.45 0.37
attr_o
sinc_o
intel_o
fun_o
amb_o
shar_o
like_o
prob_o
met_o
age 26.63 26.93
field 132.25 163.85 116.12 134.83 133.49 115.51 147.47 126.59
field_cd 8.54 8.45 7.73
undergra 70.44 1.00 1.00 15.78 123.20 106.26 125.90 120.36
mn_sat 17.91 1.00 1.00 5.77 33.38 28.15 25.65 28.34
tuition 27.84 1.00 1.00 8.51 46.66 44.44 49.47 43.18
race 2.76 2.97 2.98
imprace 4.59 3.03 4.09
imprelig 3.77 3.12 3.20
from 130.85 122.28 121.45 143.54 144.40 135.47 121.22 120.86
zipcode 165.33 191.18 167.59 150.81 120.97 181.20 183.04 178.48
income 68.78 64.03 79.24 83.16 55.05 72.04 69.15 57.11
goal 2.02 2.27 1.70
date 4.92 4.78 5.25
go_out 2.28 2.16 2.25
career 185.34 161.57 196.60 154.91 210.30 187.79 198.64 175.32
career_c 5.38 5.82 6.45
sports 6.57 6.46 6.30
tvsports 4.65 4.32 4.64
exercise 6.57 6.53 5.75
dining 7.88 8.14 7.98
museums 7.22 7.07 7.14
art 6.85 6.85 7.02
hiking 5.93 5.93 5.32
gaming 4.25 3.27 4.50
clubbing 6.20 5.65 6.36
reading 7.95 7.36 7.48
tv 5.26 5.41 5.64
theater 6.73 6.73 7.00
movies 7.78 7.79 8.07
concerts 6.45 6.89 7.09
music 7.92 7.82 7.93
shopping 5.47 6.03 5.89
yoga 4.34 4.97 4.11
exphappy 5.23 5.46
expnum
attr1_1 17.24 22.62 25.59
sinc1_1 17.93 17.04 17.23
intel1_1 18.92 20.30 20.41
fun1_1 17.45 18.03 17.68
amb1_1 14.02 10.35 7.45
shar1_1 14.45 11.63
attr4_1 11.46 33.42 35.68
sinc4_1 7.85 11.32 12.02
intel4_1 7.64 13.98 13.43
fun4_1 9.96 18.74 18.11
amb4_1 6.69 10.64 9.18
shar4_1 7.64 12.05
attr2_1 21.34 31.53 35.00
<!--/html_preserve-->

We can also "visualize" the segments using snake plots for each cluster. For example, we can plot the means of the profiling variables for each of our clusters to better visualize differences between segments. For better visualization we plot the standardized profiling variables.

## Error in eval(expr, envir, enclos): object 'ProjectData_scaled' not found
## Error in apply(ProjectData_scaled_profile[(cluster_memberships == i), : object 'ProjectData_scaled_profile' not found
## Error in ncol(ProjectData_scaled_profile): object 'ProjectData_scaled_profile' not found
## Error in colnames(Cluster_Profile_standar_mean) <- paste("Seg ", 1:length(cluster_ids), : object 'Cluster_Profile_standar_mean' not found
## Error in nrow(Cluster_Profile_standar_mean): object 'Cluster_Profile_standar_mean' not found

We can also compare the averages of the profiling variables of each segment relative to the average of the variables across the whole population. This can also help us better understand whether there are indeed clusters in our data (e.g. if all segments are much like the overall population, there may be no segments). For example, we can measure the ratios of the average for each cluster to the average of the population, minus 1, (e.g. avg(cluster) / avg(population) -1) for each segment and variable:

<!--html_preserve-->

Seg.1 Seg.2 Seg.3 Seg.4 Seg.5 Seg.6 Seg.7
id
gender 0.00 0.01 -0.01 0.00 0.00 0.00 0.00
idg -0.14 -0.07 0.00 0.05 0.11 -0.29 0.30
condtn -0.12 0.00 -0.07 0.02 0.09 -0.05 0.09
wave -0.82 -0.53 -0.21 0.01 0.28 0.60 0.85
round -0.14 -0.07 0.00 0.05 0.11 -0.28 0.30
position -0.09 -0.07 -0.02 0.04 0.11 -0.25 0.27
positin1
order -0.13 -0.06 0.00 0.04 0.11 -0.27 0.29
partner -0.13 -0.06 -0.01 0.05 0.10 -0.27 0.28
pid
match -0.10 0.21 -0.02 -0.15 0.09 0.00 -0.10
int_corr
samerace 0.02 0.01 -0.05 -0.04 0.25 -0.16 -0.12
age_o
race_o
pf_o_att
pf_o_sin
pf_o_int
pf_o_fun
pf_o_amb
pf_o_sha
dec_o -0.04 0.05 0.05 -0.06 0.04 0.07 -0.12
attr_o
sinc_o
intel_o
fun_o
amb_o
shar_o
like_o
prob_o
met_o
age
field 0.24 -0.12 0.02 0.01 -0.13 0.12 -0.04
field_cd
undergra -0.99 -0.99 -0.78 0.75 0.51 0.79 0.71
mn_sat -0.94 -0.94 -0.68 0.86 0.57 0.43 0.58
tuition -0.96 -0.96 -0.69 0.68 0.60 0.78 0.55
race
imprace
imprelig
from -0.07 -0.07 0.10 0.10 0.04 -0.07 -0.08
zipcode 0.16 0.01 -0.09 -0.27 0.10 0.11 0.08
income -0.07 0.15 0.21 -0.20 0.05 0.01 -0.17
goal
date
go_out
career -0.13 0.06 -0.16 0.13 0.01 0.07 -0.05
career_c
sports
tvsports
exercise
dining
museums
art
hiking
gaming
clubbing
reading
tv
theater
movies
concerts
music
shopping
yoga
exphappy
expnum
attr1_1
sinc1_1
intel1_1
fun1_1
amb1_1
shar1_1
attr4_1
sinc4_1
intel4_1
fun4_1
amb4_1
shar4_1
attr2_1
<!--/html_preserve-->

Questions

  1. What do the numbers in the last table indicate? What numbers are the more informative?
  2. Based on the tables and snake plot above, what are some key features of each of the segments of this solution?

Answers

* * * * * * * * * *

Step 8: Robustness Analysis

We should also consider the robustness of our analysis as we change the clustering method and parameters. Once we are comfortable with the solution we can finally answer our first business questions:

Questions

  1. How many segments are there in our market? How many do you select and why? Try a few and explain your final choice based on a) statistical arguments, b) on interpretation arguments, c) on business arguments (you need to consider all three types of arguments)
  2. Can you describe the segments you found based on the profiles?
  3. What if you change the number of factors and in general you iterate the whole analysis? Iterations are key in data science.
  4. Can you now answer the Boats case questions? What business decisions do you recommend to this company based on your analysis?

Answers

* * * * * * * * * *


\clearpage

Part 3: Purchase Drivers

We will now use the classification analysis methods to understand the key purchase drivers for boats (a similar analysis can be done for recommendation drivers). For simplicity we do not follow the "generic" steps of classification discussed in that reading, and only consider the classification and purchase drivers analysis for the segments we found above.

We are interested in understanding the purchase drivers, hence our dependent variable is column 82 of the Boats data (attr2_1) - why is that? We will use only the subquestions of Question 16 of the case for now, and also select some of the parameters for this part of the analysis:

# Please ENTER the class (dependent) variable: Please use numbers, not
# column names! e.g. 82 uses the 82nd column are dependent variable.  YOU
# NEED TO MAKE SURE THAT THE DEPENDENT VARIABLES TAKES ONLY 2 VALUES: 0 and
# 1!!!
dependent_variable = 82

# Please ENTER the attributes to use as independent variables Please use
# numbers, not column names! e.g. c(1:5, 7, 8) uses columns 1,2,3,4,5,7,8
independent_variables = c(54:80)  # use 54-80 for boats

# Please ENTER the profit/cost values for the correctly and wrong classified
# data:
actual_1_predict_1 = 100
actual_1_predict_0 = -75
actual_0_predict_1 = -50
actual_0_predict_0 = 0

# Please ENTER the probability threshold above which an observations is
# predicted as class 1:
Probability_Threshold = 50  # between 1 and 99%

# Please ENTER the percentage of data used for estimation
estimation_data_percent = 80
validation_data_percent = 10

# Please enter 0 if you want to 'randomly' split the data in estimation and
# validation/test
random_sampling = 0

# Tree parameter PLEASE ENTER THE Tree (CART) complexity control cp (e.g.
# 0.001 to 0.02, depending on the data)
CART_cp = 0.01

# Please enter the minimum size of a segment for the analysis to be done
# only for that segment
min_segment = 100